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Abstract 

The recently proposed Skip-gram model is a powerful method for learning high¬ 
dimensional word representations that capture rich semantic relationships between 
words. However, Skip-gram as well as most prior work on learning word repre¬ 
sentations does not take into account word ambiguity and maintain only a single 
representation per word. Although a number of Skip-gram modifications were 
proposed to overcome this limitation and learn multi-prototype word representa¬ 
tions, they either require a known number of word meanings or learn them using 
greedy heuristic approaches. In this paper we propose the Adaptive Skip-gram 
model which is a nonparametric Bayesian extension of Skip-gram capable to au¬ 
tomatically learn the required number of representations for all words at desired 
semantic resolution. We derive efficient online variational learning algorithm for 
the model and empirically demonstrate its efficiency on word-sense induction task. 


1 Introduction 


Continuous-valued word representations are very useful in many natural language processing appli¬ 
cations. They could serve as input features for higher-level algorithms in text processing pipeline 
and help to overcome the word sparseness of natural texts. Moreover, they can explain on their own 
many semantic properties and relationships between concepts represented by words. 


Recently, with the success of the deep learning, new methods for learning word representations 
inspired by various neural architectures were introduced. Among many others the two particular 
models Continuous Bag of Words (CBOW) and Skip-gram (SG) proposed in ( Mikolov et al.[|2013aj ) 
were used to obtain high-dimensional distributed representations that capture many semantic rela¬ 
tionships and linguistic regularities ( [Mikolov et al.| 2013a|b| ). In addition to high quality of learned 
representations these models are computationally very efficient and allow to process text data in 
online streaming setting. 


However, word ambiguity (which may appear as polysemy, homonymy, etc) an important property 
of a natural language is usually ignored in representation learning methods. For example, word 
“apple” may refer to a fruit or to the Apple inc. depending on the context. Both CBOW and SG also 
fail to address this issue since they assume a unique representation for each word. As a consequence 
either the most frequent meaning of the word dominates the others or the meanings are mixed. 
Clearly both situations are not desirable for practical applications. 


We address the problem of unsupervised learning of multiple representations that correspond to 
different meanings of a word, i.e. building multi-prototype word representations. This may be 
considered as specific case of word sense induction (WSI) problem which consists in automatic 
identification of the meanings of a word. In our case different meanings are distinguished by separate 
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representations. We define meaning or sense as distinguishable interpretation of the spelled word 
which may be caused by any kind of ambiguity. 

Word-sense induction is closely related to the word-sense disambiguation (WSD) task where the 
goal is to choose which meaning of a word among provided in the sense inventory was used in the 
context. The sense inventory may be obtained by a WSI system or provided as external information. 


Many natural language processing (NLP) applications benefit from ability to deal with word ambi¬ 
guity (Navigli & Crisafulli 2010J Vickrey et al. |20Q5[). Si nce word representations have been used 
as word features in dependency parsi ng (|Chen & Manning|[2014| ), named-entity recognition ( |Turian| 
|et al.||20T0| > and sentiment analysis (Maas et al. |201 1| > among many other tasks, employing multi¬ 
prototype representations could increase the performance of such representation-based approaches. 


In this paper we develop natural extension of the Skip-gram model which we call Adaptive Skip- 
gram (AdaGram). It retains all noticeable properties of SG such as fast online learning and high 
quality of representations while allowing to automatically learn the necessary number of prototypes 
per word at desired semantic resolution. 


The rest of the paper is organized as follows: we start with reviewing original Skip-gram model 
(section[2]i, then we describe our extension called Adaptive Skip-gram (section [3J. Then, we com¬ 
pare our model to existing approaches in section[4] In section 5]we evaluate our model qualitatively 
by considering neighborhoods of selected words in the learned latent space and by quantitative com¬ 
parison against concurrent approaches. Finally, we conclude in section[6] 


2 Skip-gram model 


The original Skip-gram model (Mikolov et al. 2013a i is formulated as a set of grouped word pre¬ 
diction tasks. Each task consists of prediction of a word v given a word w using correspondingly 
their output and input representations 


p(v\w, 9) = 


exp (inJyOuty) 

E.V=i exp {iriwoutyi) 


(1) 


where global parameter 6 = {in v , out v }^ =1 stands for both input and output representations for 
all words of the dictionary indexed with 1,... ,V. Both input and output representations are real 
vectors of the dimensionality D. 

These individual predictions are grouped in a way to simultaneously predict context words y of 
some input word x: 


p(y\x,9) = Y[p{yj\x,6). 
i 


Input text o consisting of N words 0 \, o-j,..., o y is then interpreted as a sequence of input words 
X = {xi\f =1 and their contexts Y = {yi}^. Here /-th training object ( Xi , yj) consists of word 
Xi = Oi and its context y^ = {ot}tec(i) where c(i ) is a set of indices such that \t — i\ < C /2 and 
t ^ i for all t £ c(«|!] 

Finally, Skip-gram objective function is the likelihood of contexts given the corresponding input 
words: 

N N C 

p(Y\X,6) = W_p(yi\xi,6) = II W_p{]jij\xi,9). (2) 

2—1 2—1 j — 1 


Note that although contexts of adjacent words intersect, the model assumes the corresponding pre¬ 
diction problems independent. 


For training the Skip-gram model it is common to ignore sentence and document boundaries and 
to interpret the input data as a stream of words. The objective 0 is then optimized in a stochastic 
fashion by sampling i-th word and its context, estimating gradients and updating parameters 9. After 
the model is trained, Mikolov et al. ( 2013a) > treated the input representations of the trained model 
as word features and showed that they captured semantic similarity between concepts represented 


1 For notational simplicity we will further assume that size of the context is always equal to C which is true 
for all non-boundary words. 
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by the words. Further we refer to the input representations as prototypes following (Reisinger & 
Mooney 2010b[ >. 

Both evaluation and differentiation of 0 has linear complexity (in the dictionary size V ) which 
is too expensive for practical applications. Because of that the soft-max prediction model ([TJ is 
substituted by the hierarchical soft-max ([Mnih & Hinton |2008j>: 


p(v\w,6) = cr(c/i(n)mJ,oitf„). 

n£path(v) 


(3) 


Here output representations are no longer associated with words, but rather with nodes in a binary 
tree where leaves are all possible words in the dictionary with unique paths from root to correspond¬ 
ing leaf. ch{n ) assigns either 1 or —1 to each node in the path(v) depending on whether n is a left or 
right child of previous node in the path. Equation ([3]) is guaranteed to sum to 1 i.e. be a distribution 
w.r.t. v as <j(x) = 1/(1 + exp(—a;)) = 1 — a(—x). For computational efficiency Skip-gram uses 
Huffman tree to construct hierarchical soft-max. 


3 Adaptive Skip-gram 


The original Skip-gram model maintains only one prototype per word. It would be unrealistic to 
assume that single representation may capture the semantics of all possible word meanings. At the 
same time it is non-trivial to specify exactly the right number of prototypes required for handling 
meanings of a particular word. Hence, an adaptive approach for allocation of additional prototypes 
for ambiguous words is required. Further we describe our Adaptive Skip-gram (AdaGram) model 
which extends the original Skip-gram and may automatically learn the required number of proto¬ 
types for each word using Bayesian nonparametric approach. 

First, assume that each word has K meanings each associated with its own prototype. That means 
that we have to modify ([3]i to account for particular choice of the meaning. For this reason we 
introduce latent variable 2 that encodes the index of active meaning and extend <[3]» to p{v\z = 
k, w, 9) = II nepathM <^(ch(n)inJ uk out n ). Note that we bring even more asymmetry between input 
and output representations comparing to ([3]) since now only prototypes depend on the particular 
word meaning. While it is possible to make context words be also meaning-aware this would make 
the training process much more complicated. Our experiments show that this word prediction model 
is enough to capture word ambiguity. This could be viewed as prediction of context words using 
meanings of the input words. 


However, setting the number of prototypes for all words equal is not a very realistic assumption. 
Moreover, it is desirable that the number of prototypes for a particular word would be determined 
by the training text corpus. We approach this problem by employing Bayesian nonparametrics into 
Skip-gram model, i.e. we use the constructive definition of Dirichlet process ( |Ferguson |1973) > for 
automatic determination of the required number of prototypes. Dirichlet process (DP) has been 
successfully used for infinite mixture modeling and other problems where the number of structure 
components (e.g. clusters, latent factors, etc.) is not known a priori which is exactly our case. 

We use the constructive definition of DP via the stick-breaking representation ( Sethuraman| 19941 
to define a prior over meanings of a word. The meaning probabilities are computed by dividing total 
probability mass into infinite number of diminishing pieces summing to 1. So the prior probability 
of k -th meaning of the word w is 

n k —1 

(1 fiwr )5 p(Pwk\oi) = Bet&(P wk \l,a), k = 1,... 

r—1 


This assumes that infinite number of prototypes for each word may exist. However, as long as we 
consider finite amount of text data, the number of prototypes (those with non-zero prior probabilities) 
for word w will not exceed the number of occurrences of w in the text which we denote as n w . The 
hyperparameter a controls the number of prototypes for a word allocated a priori. Asymptotically, 
the expected number of prototypes of word w is proportional to alog(n ll; ). Thus, larger values of 
a produce more prototypes which lead to more granular and specific meanings captured by learned 
representations and the number of prototypes scales logarithmically with number of occurrences. 

Another attractive property of DPs is their ability to increase the complexity of latent variables’ 
space with more data arriving. In our model this will result to more distinctive meanings of words 
discovered on larger text corpus. 
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Combining all parts together we may write the AdaGram model as follows: 


V oo N C 

p{Y, Z,(3\X, a, 9) = nn p(Pwk\a)Y[ \p( z i\ x hP)) W_p{yij\ z i, x i^) 


w= 1 k—1 


3 =1 


Mikolov et al. 


(2013a) we do 


where Z = is a set of senses for all the words. Similarly to 

not consider any regularization (and so the informative prior) for representations and seek for point 
estimate of 9. 


3.1 Learning representations 


One way to train the AdaGram is to maximize the marginal likelihood of the model 

logp(Y\X,9,a) = log jy z p(Y,Z,(3\X,a,6)df3 (4) 

with respect to representations 9. One may see that the marginal likelihood is intractable because of 
the latent variables Z and (3. Moreover, (3 and 9 are infinite-dimensional parameters. Thus unlike 
the original Skip-gram and other methods for learning multiple word representations, our model 
could not be straightforwardly trained by stochastic gradient ascent w.r.t. 9. 

To make this tractable we consider the variational lower bound on the marginal likelihood (|4]> 

£ = E q [logp(Y,Z,(3\X, a, 9) - log q(Z, (3)} 


where q(Z,(3) = nfli li z i) 11^=1IIfc=i l(Pwk) is the fully factorized variational approximation 
to the posterior p(Z, (3\X, Y. a , 9) with possible number of representations for each word truncated 
to T ( jBlei & Jordan[ |2005j ). It may be shown that the maximization of the variational lower bound 
with respect to q(Z,(3) is equ ivalent to the minim ization of Kullback-Leibler divergence between 
q(Z, (3) and the true posterior (Jordan et al. 1999). 

Within this approximation the variational lower bound C(q(Z),q(f3), 9) takes the following form: 


■ v T 


N 


C 


C(q(Z),qU3),9)=E q 


Y Y log p(/3n,k\a)-logq{P mk )+y (\og p(zi\xi,/3)-log q(zj)+y log p(yij\zi, 

.w=lk=l i=l j = l 



Setting derivatives of C(q(Z) 7 q((3), 9) with respect to q(Z) and q((3) to zero yields standard update 
equations 


k -1 


log q{zi = k) = E ?(( g) log p xuk + ^log(l - /3 Xitr ) + logp(yjj\k,Xj, 9) + const, 

r =1 

_ Y _ .T 

logg(/3) = V' V, logBeta,{/3 wk \a wk ,b wk ), 

z - J W—1 z - 'k = 1 


3 =1 


(5) 

( 6 ) 


where (natural) parameters a wk and b wk 
of assignments to particular sense n wk 

&wk = 1 T Tl wk -, b wk — OL -(- "klwr- 


deterministically depend on the expected number 


= I2i:xi=w 9{ z i = k) (Blei & Jordan 


2005): 


Stochastic variational inference. Although variational updates given by ([5]) and (|6| are tractable, 
they require the full pass over training data. In order to keep the efficiency of Skip-gram training 
procedure, we employ stochastic variational inference approach ( [Hoffman et al.[ |2013) > and derive 
online optimization algorithm for the maximization of C. There are two groups of parameters in 
our objective: {q(/3 vk )} and 9 are global because they affect all the objects; {q{zi)} are local, i.e. 
affect only the corresponding object x % . After updating the local parameters according to ([5} with 
the global parameters fixed and defining the obtained distribution as q*(Z) we have a function of 
the global parameters C*(q(f3), 9) = C(q*(Z), q((3),9) > C(q(Z),q((3), 9). 


The new lower bound C* is no longer a function of local parameters which are always kept updated 
to their optimal values. Following |Hoffman et al. ( 2013| > we iteratively optimize C* with respect to 
the global parameters using stochastic gradient estimated at a single object. Stochastic gradient w.r.t 
9 computed on the ?-th object is computed as follows: 


v e r 


E C r—,T 

Y 

1 = 1 z —'k =1 


k)Xg logp(yij\k, Xi, 9). 
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Now we describe how to optimize C* w.r.t global posterior approximation q((3) = 

IUi Ilfc—i <l(0wk)- The stochastic gradient with respect to natural parameters a w k and b w k ac¬ 
cording to ( jHoffman et al.[ |2013) > can be estimated by computing intermediate values of natural 
parameters ( a w k, bwk) on the i-th data point as if we estimated q{z) for all occurrences of Xi = w 
equal to q(zi): 

E t 

n w q(zi = r), 

r=k -\-1 

where n w is the total number of occurrences of word w. The stochastic gradient estimate then can 
be expressed in the following simple form: V amfc £* = a w k — a w k, Vb mfc £* = b w k — b w k . One may 
see that making such gradient update is equivalent to updating counts n w k since they are sufficient 
statistics of q((3 w k). 

We use conservative initialization strategy for q(/3) starting with only one allocated meaning for 
each word, i.e. n w i = n w and n w k = 0, k > 1. Representations are initialized with random values 
drawn from Uniform(—0.5/D, 0.5/D). In our experiments we updated both learning rates p and A 
using the same linear schedule from 0.025 to 0. 

The resulting learning algorithm [l] may be also interpreted as an instance of stochastic variational 
EM algorithm. It has linear computational complexity in the length of text o similarly to Skip-gram 
learning procedure. The overhead of maintaining variational distributions is negligible comparing 
to dealing with representations and thus training of AdaGram is T times slower than Skip-gram. 

3.2 Disambiguation and prediction 

After model is trained on data T> = { (a;, , y ,;)};/ =1 , it can be used to infer the meanings of an input 
word x given its context y. The predictive probability of a meaning can be computed as 

p{z = k\x, V, 9, a) cx J p(z = k\/3, x)q((3)d/3, (7) 

where q(/3) can serve as an approximation of p((3\U, 0 1 a). Since q(/3) has the form of independent 
Beta distributions whose parameters are given in sec. 3.1 the integral can be taken analytically. The 
number of learned prototypes for a word w may be computed as Ylh-\ 1 \p( z = k\w,T>,0,a ) > e] 
where e is a threshold e.g. 10 -3 . 

The probability of each meaning of x given context y is thus given by 

p{z = k\x, y,0) cx p{y\x, k, 6) f p(k\/3, x)q(f3)df3 (8) 

Now the posterior predictive over context words y given input word x may be expressed as 

p(y\x,V,9,a) = JJ2 1 z= 1 p(y\x,z,6)p(z\f3,x)q(f3)d(3. (9) 


4 Related work 


Literature on learning continuous-space representations of embeddings for words is vast, therefore 
we concentrate on works that are most relevant to our approach. 

In the works (Huang et al. 2012| Reisinger & Mooney| |2010a|b[ > various neural network-based 
methods for learning multi-prototype representations are proposed. These methods include cluster¬ 
ing contexts for all words as prepossessing or intermediate step. While this allows to learn multiple 
prototypes per word, clustering large number of contexts brings serious computational overhead and 
limit these approaches to offline setting. 

Recently various modifications of Skip-gram were proposed to learn multi-prototype representa¬ 
tions. Proximity-Ambiguity Sensitive Skip-gram ( |Qiu et al. 2014 1 maintains individual represen¬ 
tations for different parts of speech (POS) of the same word. While this may handle word ambiguity 
to some extent, clearly there could be many meanings even for the same part of speech of some word 
remaining not discovered by this approach. 


Work of Tian et al. ( |2014 [) can be considered as a parametric form of our model with number of 
meanings for each word fixed. Their model also provides improvement over original Skip-gram, 
but it is not clear how to set the number of prototypes. Our approach not only allows to efficiently 
learn required number of prototypes for ambiguous words, but is able also to gradually increase the 
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Figure 1: Left: Distribution of number of word meanings learned by AdaGram model for different 
values of parameter a. For the number of meanings k we plot the log 10 (nfc + 1), where nk is the 
number of words with k meanings. Right: All words in dictionary were divided into 30 bins ac¬ 
cording to the logarithm of their frequency. Here we plot the number of learned prototypes averaged 
over each such bin. 


number of meanings when more data becomes available thus distinguishing between shades of same 
meaning. 

It is also possible to incorporate external knowledge about word meanings into Skip-gram in the 
form of sense inventory (Chen et ah, 2014 1 . First, single-prototype representations are pre-trained 
with original Skip-gram. Afterwards, meanings provided by WordNet lexical database are used learn 
multi-prototype representations for ambiguous words. The dependency on the external high-quality 
linguistic resources such as WordNet makes this approach inapplicable to languages lacking such 
databases. In contrast, our model does not consider any form of supervision and learns the sense 
inventory automatically from the raw text. 


Recent work of Neelakantan et al. (2014 1 proposing Multi-sense Skip-gram (MSSG) and its non- 
parameteric (not in the sense of Bayesian nonparametrics) version (NP MSSG) is the closest to 


AdaGram prior art. While MSSG defines the number of prototypes a priori similarly to (Tian et al. 
|2014| ), NP MSSG features automatic discovery of multiple meanings for each word. In contrast to 
our approach, learning for NP MSSG is defined rather as ad-hoc greedy procedure that allocates 
new representation for a word if existing ones explain its context below some threshold. AdaGram 
instead follows more principled nonparametric Bayesian approach. 


5 Experiments 


In this section we empirically evaluate our model in a number of different tests. First, we demon¬ 
strate learned multi-prototype representations on several example words. We investigate how dif¬ 
ferent values of a affect the number of learned prototypes what we call a semantic resolution of a 
model. Then we evaluate our approach on the word sense induction task (WSI). We also provide 
more experiments in the supplementary material. 


In order to evaluate our method we trained several models with different values of a on April 2010 
snapshot of English Wikipedia ( Shaoul & Westbury) 2010| . It contains nearly 2 million articles 
and 990 million tokens. We did not consider words which have less than 20 occurrences. The 
context width was set to C = 10 and the truncation level of Stick-breaking approximation (the 
maximum number of meanings) to T = 30. The dimensionality D of representations learned by our 
model was set to 300 to match the dimensionality of the models we compare with. 


5.1 Nearest neighbours of learned prototypes 

In Table [4] (see appendix) we present the meanings which were discovered by our model with pa¬ 
rameter a = 0.1 for words used in ( (Neelakantan et al .'} [2014[ ) and for a few other sample words. To 
distinguish the meanings we obtain their nearest neighbors by computing the cosine similarity be¬ 
tween each meaning prototype and the prototypes of meanings of all other words. One may see that 
AdaGram model learns a reasonable number of prototypes which are meaningful and interpretable. 
The predictive probability of each meaning reflects how frequently it was used in the training corpus. 
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Table 1: Nearest neighbours of different prototypes of words “light” and “core” learned by AdaGram 
under different values of a and corresponding predictive probabilities. 

ALPHA “LIGHT” “CORE” 

p(z) nearest neighbours p(z) nearest neighbours 


Skip-gram 

1.00 

far-red, emitting 

0.05 

1.00 

far-red, illumination 

0.075 

0.28 

armoured, amx-13, kilcrease 

0.72 

bright, sunlight, luminous 


0.09 

tvarbanan, hudson-bergen 


0.17 

dark, bright, green 

0.1 

0.09 

4th, dragoons, 2nd 

0.26 

radiation, ultraviolet 


0.28 

darkness, shining, shadows 


0.11 

self-propelled, armored 


1.00 cores, component, i7 
0.40 corium, cores, sub-critical 
0.60 basic, i7, standards-based 
0.30 competencies, curriculum 
0.34 cpu, cores, i7, powerxcell 
0.36 nucleus, backbone 

0.21 reactor, hydrogen-rich 
0.13 intel, processors 
0.27 curricular, competencies 

0.15 downtown, cores, center 

0.24 nucleus, rag-tag, roster 


Table 2: Test log-likelihood under different a on sample from Wikipedia (see sec. 5.3 i and Adjusted 
Rand Index (ARI) on the training part of WWSI dataset (see sec. 5.4 1 . 


MODEL LOG-LIKELIHOOD ARI 


Skip-Gram.300D 


-7.403 

- 

Skip-Gram.600D 


-7.387 

- 

AdaGram.300D a 

= 0.05 

-7.399 

0.007 

AdaGram.300D a 

= 0.1 

-7.385 

0.226 

AdaGram.300D a 

= 0.15 

-7.382 

0.268 

AdaGram.300D a 

= 0.2 

-7.378 

0.254 

AdaGram.300D a 

= 0.25 

- 7.375 

0.250 

AdaGram.300D a 

= 0.5 

-7.387 

0.230 


For most of the words a = 0.1 results in most interpretable model. It seems that for values less than 
0.1 for most words only one prototype is learned and for values greater than 0.1 the model becomes 
less interpretable as learned meanings are too specific sometimes duplicating. 

5.2 Semantic resolution 

As mentioned in section[3]hyperparameter a of the AdaGram model indirectly controls the number 
of induced word meanings. Figure[l] Left shows the distribution of number of induced word mean¬ 
ings under different values of a. One may see that while for most words relatively small number 
of meanings is learned, larger values of a lead to more meanings in general. This effect may be 
explained by the property of Dirichlet process to allocate number of prototypes that logarithmically 
depends on number of word occurrences. Since word occurrences are known to be distributed by 
Zipf’s law, the majority of words is rather infrequent and thus our model discovers few meanings 
for them. Figure [T] Right quantitatively demonstrates this phenomenon. 

In the Table |T] we demonstrate how larger values of a lead to more meanings on the example of 
the word “light”. The original Skip-gram discovered only the meaning related to a physical phe¬ 
nomenon, AdaGram with a = 0.075 found the second, military meaning, with further increase of a 
value those meanings start splitting to submeanings, e.g. light tanks and light troops. Similar results 
are provided for the word “core”. 

5.3 Word prediction 

Since both Skip-gram and AdaGram are defined as models for predicting context of a word, it 
is essential to evaluate how well they explain test data by predictive likelihood. We use last 200 
megabytes of December 2014 snapshot of English Wikipedia as test data for this experiment. 

Similarly to the train procedure we consider this text as pairs of input words and contexts of size 
C = 10, that is, T> t est = {(.Xi,yi)}fL 1 and compare AdaGram with original Skip-gram by average 
log-likelihood (see sec. |3.2| i. We were unable to include MSSG and NP-MSSG into the comparison 
as these models do not estimate conditional word likelihood. The results are given in Table [2] 
Clearly, AdaGram models text data better than Skip-gram under wide range of values of a. 
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Figure 2: Recall (left) and precision (right) for task-11 of Semeval-2013 competition. K is the 
position of the snippet in the search result page, r is the recall value, see text for details. 


Table 3: Adjusted rand index (ARI) for word sense induction task for different datasets. Here we 
use the test subset of WWSI dataset. See sec.|5.4|for details. 


MODEL SE-2007 SE-2010 SE-2013 WWSI 


MSSG.300D.30K 

0.048 

0.085 

0.033 

0.194 

NP-MSSG.50D.30K 

0.031 

0.058 

0.023 

0.163 

NP-MSSG.300D.6K 

0.033 

0.044 

0.033 

0.110 

MPSG.300D 

0.044 

0.077 

0.014 

0.160 

AdaGram.300D.a = 0.15 

0.069 

0.097 

0.061 

0.286 


Since AdaGram has more parameters than Skip-Gram with the same dimensionality of represen¬ 
tations, it is natural to compare its efficiency with Skip-Gram that has the comparable number of 
parameters. The model with a = 0.15 which we study extensively further has approximately 2 
learned prototypes per word in average, so we doubled the dimensionality of Skip-Gram and in¬ 
cluded it into comparison as we 1 0 One may see that AdaGram with a equal to 0.15 outperforms 
600-dimensional Skip-Gram and so does the model with a = 0.1. 


5.4 Word-sense induction 


The nonparametric learning of a multi-prototype representation model is closely related to the word- 
sense induction (WSI) task which aims at automatic discovery of different meanings for the words. 
Indeed, learned prototypes identify different word meanings and it is natural to assess how well they 
are aligned with human judgements. 


We compare our AdaGram model with Nonparametric Multi-sense Skip-gram (NP-MSSG) pro¬ 
posed by |Neelakantan et~ak ( 2Q14) > which is currently the only existing approach to learning multi¬ 
prototype word representations with Skip-gram. We also include in comparison the parametric form 
of NP-MSSG which has the number of meanings fixed to 3 for all words during the training. All 
models were trained on the same dataset which is the Wikipedia snapshot by (Shaoul & Westbury| 
( 2010| l. For the comparison with MSSG and NP-MSSG we used source code and models released 
by the authors. Neelakantan et al. (2014) limited the number of words for which multi-prototype 
representations were learned (30000 and 6000 most frequent words) for these models. We use the 
following notation: 300D or 50D is the dimensionality of word representations, 6K or 30K is the 
number of multi-prototype words (6000 and 30000 respectively) in case of MSSG and NP-MSSG 
models. Another baseline is Multi-prototype Skip-Gram (MPSG) proposed by Tian et al. (2014) 
which can be seen as a special case of AdaGram with number of senses fixed. We have trained this 
model similarly to (jTian et akj|2014[) setting number of senses for each word equal to 3. 


The evaluation is performed as follows. Dataset consisting of target word and context pairs is sup¬ 
plied to a model which uses the context to disambiguate target word into a meaning from its learned 
sense inventory. Then for each target word the model’s labeling of contexts and ground truth one 
are compared as two different clusterings of the same set using appropriate metrics. The results are 
then averaged over all target words. 


Data We consider several WSI datasets in our experiments. The SemEval-2007 dataset was intro¬ 
duced for SemEval-2007 Task 2 competition, it contains 27232 contexts collected from Wall Street 
Journal (WSJ) corpus. The SemEval-2010 was similarly collected for the SemEval-2010 Task 14 

2 Also note that 600-dimensional Skip-Gram has twice more parameters in hierarchical softmax than 300- 
dimensional AdaGram 
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competition and contains 8915 contexts in total, part obtained from web pages returned by a search 
engine and the other part from news articles. We also consider SemEval-2013 Task 13 dataset con¬ 
sisting from 4664 contexts (we considered only single-term words in this dataset). 

In order to make the evaluation more comprehensive, we introduce the new Wikipedia Word-sense 
Induction (WWSI) dataset consisting of 188 target words and 36354 contexts. For the best of our 
knowledge it is currently the largest WSI dataset available. While SemEval datasets are prepared 
with hand effort of experts which mapped contexts into gold standard sense inventory, we collected 
WWSI using fully automatic approach from December 2014 snapshot of Wikipedia. The dataset 
is splitted evenly into train and test parts. More details on the dataset construction procedure are 
provided in the supplementary material, sec. 3. 

For all SemEval datasets we merged together train and test contexts and used them for model com¬ 
parison. Each model was supplied with contexts of the size that maximizes its ARI performance. 


Metrics Authors of SemEval dataset Manandhar et al. ( 2010| > suggested two metrics for model 
comparison: V-Measure (VM) and F-Score (FS). They pointed to the weakness of both VM and 
FS. VM favours large number of clusters and attains large values on unreasonable clusterings which 
assign each instance to its own cluster while FS is biased towards clusterings consisting of small 
number of clusters e.g. assigning each instance to the same single cluster. Thus we consider another 
metric - adjusted Rand index (ARI) ( |Hubert & Arabie] |1985] > which does not suffer from such 
drawbacks. Both examples of undesirable clusterings described above will get ARI of nearly zero 
which corresponds to human intuition. Thus we consider ARI as more reliable metric for WSI 
evaluation. We still report VM and FS values in the suppl. material (sec. 4) in order to make our 
results comparable to others obtained on the datasets. 


Evaluation Since AdaGram is essentially influenced by hyperparameter a, we first investigate 
how different choices of a affect WSI performance on the train part of our WWSI dataset in terms 
of ARI, see table[2] The model with a = 0.15 attains maximum ARI and thus we use this model for 
all further experiments. 

We compare AdaGram against MSSG and NP-MSSG on all datasets described above, see table[3]for 
results. AdaGram consistently outperforms the concurrent approaches on all datasets and achieves 
significant improvement on the test part of WWSI dataset. One may see that nonparametric version 
of MSSG delivers consistently worse performance than MSSG with number of prototypes fixed to 
3. This suggests that the ability to discover different word meanings is rather limited for NP-MSSG. 
The fact that AdaGram substantially outperforms NP-MSSG indicates that more principled Bayesian 
nonparametric approach is more suitable for the task of word-sense induction. 

One may note that results on SemEval datasets are smaller than results on WWSI dataset consis¬ 
tently for all models. We explain this by the difference between the train corpus and test data used 
for preparing SemEval datasets such as news articles as well as by the difference between sense 
inventories, i.e. SemEval data uses WordNet and OntoNotes as sources of word meanings. 


5.5 Web search results diversification 


In this experiment we follow the methodology of Semeval-2013 Task 11 competition. Systems are 
given an ambiguous query and web search result snippets which have to be clustered. The main 
goal of this task is to measure the ability of systems to diversify web search results. The authors 
of this task (Di Marco & Navigli |2013) proposed to use two following metrics for evalutation. 
Subtopic Recall @K measures how many different word meanings (from the gold standard sense 
inventory) are covered by top K diversified search results. Subtopic Precision@r determines the 
ratio of different meanings provided in the first K r results where K, is minumum number of top K 
results achieving recall r. We consider only single-token words in this comparison. The results are 
shown in the Figure [2] One may see that curves of AdaGram are monotonically higher than curves 
of all concurrent models suggesting that AdaGram is more suitable on such real-world application. 


6 Conclusion 

In the paper we proposed AdaGram which is the Bayesian nonparametric extension of the well- 
known Skip-gram model. AdaGram uses different prototypes to represent a word depending on 
the context and thus may handle various forms of word ambiguity. Our experiments suggest that 
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representations learned by our model correspond to different word meanings. Using resolution 
parameter a we may control how many prototypes are extracted from the same text corpus. Too 
large values of a lead to different prototypes that correspond to the same meaning which decreases 
model performance. The values a = 0.1 — 0.2 are generally good for practical purposes. For 
those values the truncation level T = 30 is enough and does not affect the number of discovered 
prototypes. AdaGram also features online variational learning algorithm which is very scalable 
and makes it possible to train our model just several times slower than extremely efficient Skip- 
gram model. Since the problem of learning multi-prototype word representation is closely related 
to word-sense induction, we evaluated AdaGram on several WSI datasets and contributed a new 
large one obtained automatically from Wikipedia disambiguation pages. The source code of our 
implementation, WWSI dataset and all trained models are available at http: / /github . com/ 
sbos/AdaGram.jl 
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7 Appendix 


Algorithm 1 Training AdaGram model 

Input: training data {(a:,;, y,)]A, hyperparameter a 
Output: parameters 9, distributions q(/3), q(z) 

Initialize parameters 9, distributions q(/3), q(z) 

for i = 1 to N do 

Select word w = Xi and its context y, 

Local step: 
for k = 1 to T do 

7 ik = E ?(/3 ra )[logp(^ = k\f3,Xi)\ 
for j = 1 to C do 

lik 4- lik +\ogp(yij\xi,k,9) 
end 
end 

7 ik exp(7ifc)/ exp(j it ) 

Global step: 

p t <- 0.025(1 - i/N), A t 4- 0.025(1 - i/N) 

for k = 1 to T do 

Update Tiwk 4 (1 X t )n w k T X±Ti w ^ji k 

end 

Update 9-^9 + p t N e J2 k Ej 7 ik \ogp(y ZJ \xi, k, 9) 

end 


7.1 Full evaluation on WSI task 

We report V-measure and F-score values in tables [5] and [6] The results are rather contradicting due 
to the reasons we described in the paper: while V-measure prefers larger number of meanings F- 
score encourages small number of meanings. We report these numbers in order to make the values 
comparable with other results. 


7.2 Full evaluation of SemEval-2013 Task-13 


This task evaluates Word Sense Induction systems by performing fuzzy clustering comparison, i.e. 
in the gold standard each context could be assigned to several meanings with some score indicating 
confidence of the assignment. Two metrics were used for comparing such fuzzy clusterings: Fuzzy 


Normalized Mutual Information and Fuzzy-B-Cubed which are introduced in (Jurgens & Klapaftis 
|2013| ). Fuzzy-NMI measures the alignment of two clustering and it is independent of the cluster 
sizes. It is suitable to measure how well the model captures rare senses. On the contrary Fuzzy- 
B-Cubed is sensitive to the cluster sizes. So it reflects the performance of the system on a dataset 
where the clusters have almost the same frequency. Results of MSSG, NP-MSSG and AdaGram 
models are shown in Table [7j 

However these measures have similar drawbacks as V-Measure and F-score described above. Trivial 
solution like assigning one sense per each context obtains high value of Fuzzy-NMI while treating 
each word as single-sense one performs well in terms of Fuzzy-B-Cubed. All WSI systems partici¬ 
pated in this task failed to completely surpass these baselines according to the Table 3 in (Jurgens & 
Klapaftis} 2013} ). Hence we consider ARI comparison as more reliable. Note that since we excluded 


multi-token words from the evaluation the numbers we report are not comparable with other results 
made on the dataset. 

The ARI comparison we report in the paper was done by transforming fuzzy clusterings into hard 
ones, i.e. each context was assigned to most probable meaning. 


7.3 WWSI Dataset construction details 

Similarly to ( jNavigli & Vannella}[2013| we considered Wikipedia’s disambiguation pages as a list 
of ambiguous words. From that list we have selected target single-term words which had occurred 
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Table 4: Nearest neighbors of meaning prototypes learned by the AdaGram model with a = 0.1. In 
the second column we provide the predictive probability of each meaning. 


WORD 

p(z) 

NEAREST NEIGHBOURS 

python 

0.33 

monty, spamalot, cantsin 


0.42 

perl, php, java, C++ 


0.25 

molurus, pythons 

apple 

0.34 

almond, cherry, plum 


0.66 

macintosh, iifx, iigs 

date 

0.10 

unknown, birth, birthdate 


0.28 

dating, dates, dated 


0.31 

to-date, stateside 


0.31 

deadline, expiry, dates 

bow 

0.46 

stern, amidships, bowsprit 


0.38 

spear, bows, wow, sword 


0.16 

teign, coxs, evenlode 

mass 

0.22 

vespers, masses, liturgy 


0.42 

energy, density, particle 


0.36 

wholesale, widespread 

run 

0.02 

earned, saves, era 


0.35 

managed, serviced 


0.26 

2-run, ninth-inning 


0.37 

drive, go, running, walk 

net 

0.34 

pre-tax, pretax, billion 


0.28 

negligible, total, gain 


0.16 

fox, est/edt, sports 


0.23 

puck, ball, lobbed 

fox 

0.38 

cbs, abc, nbc, espn 


0.14 

raccoon, wolf, deer, foxes 


0.33 

abc, tv, wonderfalls 


0.14 

gardner, wright, taylor 

rock 

0.23 

band, post-hardcore 


0.10 

little, big, arkansas 


0.29 

pop, funk, r&b, metal, jazz 


0.14 

limestone, bedrock 


0.23 

’n\ roll, ‘n’, ’n 


in the text at least 5000 times to ensure there is enough training contexts in Wikipedia to cap¬ 
ture different meanings of a word (note, however, that all models were trained on earlier snap¬ 
shot of Wikipedia). We also did not consider pages belonging to some categories such as “Letter- 
number_combination_disambiguation_pages” as they did not contain meaningful words. Then we 
prepared the sense inventory for each word in the list using Wikipedia pages with names matching 
to the pattern “WORD_(*)” which is used as convenient naming of specific word meanings. Again, 
we applied some automatic filtering to remove names of people and geographical places in order to 
obtain more coarse-grained meanings. Finally for each page selected on the previous step we find 
all occurrences of the target word on it and use its 5-word neighbourhood (5 words on the left and 
5 words on the right) as a context. Such size of the context was chosen to minimize the intersec¬ 
tion between adjacent contexts but still provide enough words for disambiguation. 10-word context 
results into average intersection of 1.115 words. 

The list of the categories pages belonging to which were excluded during target word selection is 
following: 

• Place_name_disambiguation_pages 

• Disambiguation_pages_with_surname-holder_lists 

• Human_name_disambiguation_pages 

• Lists_of_ambiguous_numbers 

• Disambiguation_pages_with_given-name-holder_lists 

• Letter-number_combination_disambiguation_pages 
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• Two-letter_disambiguation_pages 

• Transport_route_disambiguation_pages 

• Temple_name_disambiguation_pages 

• and also those from the categories which name contains one of the substrings: “cleanup”, 
“people”, “surnames” 


During the sense inventory collection we do not consider pages which name contains one of the fol¬ 
lowing substrings: “tv_”, “series”, “movie”, “film”, “song”, “album”, “band”, “singer”, “musical", 
“comics"; and also those from the categories with names containing geography terms “countries”, 
“people”, “province”, “provinces”. 


8 Experiments on contextual word similarity 


In this section we compare AdaGram to other multi-prototype models on the contextual word sim¬ 
ilarity task using the SCWS dataset proposed in (Huang et al. 2012] >. The dataset consists of 2003 
pairs of words each assigned with 10 human judgements on their semantic similarity. The common 
evaluation methodology is to average these 10 values for each pair and measure Spearman’s rank 
correlation of the result and the similarities obtained using word representations learned by a model, 
i.e. by a cosine similarity of corresponding vectors. 


There are two measures of word similarity based on context: expected similarity of prototypes with 
respect to posterior distributions given contexts 


AvgSimC(wi,w 2 ) = EE- 

Al A 2 

EE p(ki\wi, Ci)p(k 2 \w 2 , C 2 ) cos(vec(wi, fci), vec(w 2 , k 2 )), 

k 1 &2 


and similarity of the most probable prototypes given contexts 


MaxSimC(wi,w 2 ) = cos(vec(wi, k\),vec{yj 2 , k 2 )), 


where k 1 = argmaxfcp(fc|uii, C\) and k2 = argmaxfcp(fc|u> 2 , C 2 ), correspondingly. Here we 
define Ki and K 2 as the number learned prototypes for each of the words and C\, C 2 as their 
corresponding contexts. In AdaGram vec(w, k) = In w k and the posterior distribution over word 
senses is computed according to sec. 3.2. For word disambiguation AdaGram uses 4 nearest words 
in a context. Results for NP-MSSG and MSSG are taken from (Neelakant an et al .] |20 1 and results 
for MPSG - from ( |Tian et aL]|2014| l. 

We also consider the original Skip-gram model as a baseline. We train two models: the first one 
with prototypes of dimensionality 300 based on hierarchical soft-max and the second one of dimen¬ 
sionality 900 trained using negative sampling ( jMikolov et al.| |2013a) > (number of negative samples 
is set to 5, three iterations over training data are made). The training data and other parameters are 
identical to the training of AdaGram used in main experiments. Note that for Skip-gram measures 
AvgSimC and MaxSimC coincide because the model learns only one representation per word. 


The results on the experiment are provided in table 8] NP-MSSG model of [Neelakantan et ah (20141 
outperforms other models in terms of AvgSimC , however, one may see that the improvement over 
900-dimensional Skip-Gram baseline is only marginal, moreover, the latter is the second best model 
despite ignoring the contextual information and hence being unable to distinguish between differ¬ 
ent word meanings. This may suggest that SCWS is of limited use for evaluating multi-prototype 
word representation models as the ability of differentiating between word senses is not necessary to 
achieve a good score. One may consider another example of an undesirable model which will not 
be penalized by the target metric in the world similarity task. That is, if a model learned too many 
prototypes for a word, e.g. with very close vector representations it is hardly usable in practice, 
but as long as averaged similarities between prototypes correlate with human judgements such non- 
interpretability will not be accounted during evaluation. We thus consider word-sense induction as 
a more natural task for evaluation since it explicitly accounts for proper and interpretable mapping 
from contexts into discovered word meanings. 
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Table 5: V-Measure for word sense induction task for different datasets. Here we use the test subset 
ofWWSI dataset. 


MODEL SEMEVAL-2007 SEMEVAL-2010 SEMEVAL-2013 WWSI 


MSSG.300D.30K 

0.067 

0.144 

0.033 

0.215 

NP-MSSG.50D.30K 

0.057 

0.119 

0.023 

0.188 

NP-MSSG.300D.6K 

0.073 

0.089 

0.033 

0.128 

AdaGram.300D a = 0.15 

0.114 

0.200 

0.192 

0.326 


Table 6: F-Score for word sense induction task for different datasets. Here we use the test subset of 
WWSI dataset. 


MODEL SEMEVAL-2007 SEMEVAL-2010 SEMEVAL-2013 WWSI 


MSSG.300D.30K 

0.528 

0.492 

0.437 

0.632 

NP-MSSG.50D.30K 

0.496 

0.488 

0.392 

0.621 

NP-MSSG.300D.6K 

0.557 

0.531 

0.419 

0.660 

AdaGram.300D a = 0.15 

0.448 

0.439 

0.342 

0.588 


Table 7: Fuzzy Normalized Mutual Information and Fuzzy B-Cubed metric values for task-13 of 
Semeval-2013 competition. See text for details. 


MODEL 

FUZZY-NMI 

FUZZY-B-CUBED 

MSSG.300D.30K 

0.070 

0.287 

NP-MSSG.50D.30K 

0.064 

0.273 

NP-MSSG.300D.6K 

0.063 

0.290 

AdaGram.300D ct = 0.15 

0.089 

0.132 


Table 8: Spearman’s rank correlation results for contextual similarity task on SCWS dataset. Num¬ 
bers are multiplied with 100. 


MODEL 

AvgSimC 

MaxSimC 

MSSG.300D.30K 

69.3 

57.26 

NP-MSSG.50D.30K 

66.1 

50.27 

NP-MSSG.300D.6K 

69.1 

59.8 

MPSG.300D 

65.4 

63.6 

Skip-Gram.300D 

65.2 

65.2 

Skip-Gram.900D 

68.4 

68.4 

AdaGram.300D a = 0.15 

61.2 

53.8 
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